Dealing with unknown words in statistical machine translation

نویسندگان

  • João Pedro Carlos Gomes da Silva
  • Luísa Coheur
  • Ângela Costa
  • Isabel Trancoso
چکیده

In Statistical Machine Translation, words that were not seen during training are unknown words, that is, words that the system will not know how to translate. In this paper we contribute to this research problem by profiting from orthographic cues given by words. Thus, we report a study of the impact of word distance metrics in cognates’ detection and, in addition, on the possibility of obtaining possible translations of unknown words through Logical Analogy. Our approach is tested in the translation of corpora from Portuguese to English (and vice-versa).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Overcoming Vocabulary Sparsity in MT Using Lattices

Source languages with complex wordformation rules present a challenge for statistical machine translation (SMT). In this paper, we take on three facets of this challenge: (1) common stems are fragmented into many different forms in training data, (2) rare and unknown words are frequent in test data, and (3) spelling variation creates additional sparseness problems. We present a novel, lightweig...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Statistical Machine Translation for Twitter

We consider the problem of translating short messages (Tweets) using Europarl as a starting-point. After highlighting some of the domain differences between Europarl and Twitter, we show that for German-English translation, we can improve performance from a baseline BLEU score of 25.58 to 53.45. By far and away the single most important improvement is passing-through unknown words (which are ma...

متن کامل

Handling Unknown Words in Statistical Machine Translation from a New Perspective

Unknown words are one of the key factors which drastically impact the translation quality. Traditionally, nearly all the related research work focus on obtaining the translation of the unknown words in different ways. In this paper, we propose a new perspective to handle unknown words in statistical machine translation. Instead of trying great effort to find the translation of unknown words, th...

متن کامل

Analogical translation of unknown words in a statistical machine translation framework

In this paper we address the problem of translating unknown words in a statistical machine translation framework. In data-driven machine translation, words that are not seen in the data may not be translated and are either discarded or left as is in the output. They are refered to as unknown words. The unknown word problem increases when the available bilingual data is scarce. In order to addre...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012